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Abstract 



We discuss the role of automatic difFerentiation tools in optimization software. We 
emphasize issues that are important to large-scale optimization and that have proved 
useful in the installation of nonlinear solvers in the NEOS Server. Our discussion centers 
on the computation of the gradient and Hessian matrix for partially separable functions 
and shows that the gradient and Hessian matrix can be computed with guaranteed 
bounds in time and memory requirements. 

1 Introduction 

Despite advances in automatic differentiation algorithms and software, researchers disagree 
on the value of incorporating automatic differentiation tools in optimization software. There 
are various reasons for this state of affairs. An important reason seems to be that little 
published experience exists on the effect of automatic differentiation tools on realistic prob- 
lems, and thus users worry that automatic differentiations tools are not applicable to their 
problems or are too expensive in terms of time or memory. Whatever the reasons, few 
optimization codes incorporate automatic differentiation tools. 

Without question, incorporating automatic difFerentiation tools into optimization is not 
only useful but, in many cases, essential in order to promote the widespread use of state- 
of-the-art optimization software. For example, a Newton method for the solution of large 
bound-constrained problems 



where / : i— > M and xi and Xu define the bounds on the variables, requires that the user 
provide procedures for evaluating the function f{x) and also the gradient V/(x), the sparsity 
pattern of the Hessian matrix V^/(a;), and the Hessian matrix V^/(x). The demands on 
the user increase for the constrained optimization problem 



where c : M"" are the nonlinear constraints. In this case the user must also provide 

the sparsity pattern and the Jacobian matrix (/{x) of the constraints. In some cases the 
user may even be asked to provide the Hessian matrix of the Lagrangian 
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of the optimization problem. The time and effort required to obtain this information and 
verify their correctness can be large even for simple problems. Clearly, any help in simpli- 
fying this effort would promote the use of the software. 

In spite of the advantages offered by automatic differentiation tools, relatively little ef- 
fort has been made to interface optimization software with automatic differentiation tools. 
Dixon [H, in was an early proponent of the integration of automatic differentiation with 
optimization, but to our knowledge Liu and Tits [24| were the first to provide interfaces 
between a general nonlinear constrained optimization solver (FSQP) and automatic differ- 
entiation tools (ADIFOR). 

Modeling languages for optimization (for example, AMPL and GAMS |jl^) pro- 
vide environments for solving optimization problems that deserve emulation. These envi- 
ronments package the ability to calculate derivatives, together with state-of-the-art opti- 
mization solvers and a language that facilitates modeling, to yield an extremely attractive 
problem-solving environment. 

The NEOS Server for Optimization is another problem-solving environment that 
integrates automatic differentiation tools and state-of-the-art optimization solvers. Users 
choose a solver and submit problems via the Web, email (aeos@nics.anl.gov), or a Java- 
enabled submission tool. When a submission arrives, NEOS parses the submission data and 
relays that data to a computer associated with the solver. Once results are obtained, they 
are sent to NEOS, which returns the results to the user. Submissions specified in Fortran 
are processed by ADIFOR |^, while C submissions are handled by ADOL-C Since 
the initial release in 1995, the NEOS Server has continued to add nonlinear optimization 
solvers with an emphasis on large-scale problems, and the current version contains more 
than a dozen different nonlinear optimization solvers. 

Users of a typical computing environment would like to solve optimization problems 
while only requiring that the user provide a specification of the problem; all other quantities 
required by the software (for example, gradients, Hessians, and sparsity patterns) would be 
generated automatically. Optimization modeling languages and the NEOS Server provide 
this ability, but as noted above, users of nonlinear optimization solvers are usually asked to 
provide derivative information. 

Our goal in this paper is to discuss techniques for using automatic differentiation tools 
in large-scale optimization software. We highlight issues that are relevant to solvers in the 
NEOS Server. For recent work on the interface between automatic differentiation tools and 
large-scale solvers, see [|^, We pay particular attention to the computation of second- 
order (Hessian) information since there is evidence that the use of second-order information 
is crucial to the solution of large-scale problems. The main concern is the cost of obtaining 
second-order information. See |2|, for related work. 

We note that at present most optimization software for large-scale problems use only 
first-order derivatives. Indeed, of the nonlinear solvers available in the NEOS Server, only 
LANCELOT, LOQO, and TRON accept second-order information. We expect this situation 
to change, however, as automatic differentiation tools improve and provide second-order 
information with the same reliability and efficiency as are currently available for first-order 
information. 
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2 Partially Separable Functions 



We consider the computation of the gradient and Hessian matrix of a partially separable 
function, that is, a function / : R" i— >• M of the form 

m 

f{x) = Y,fkix)^ (2-1) 

k=l 

where the component functions fk '■ M"" M are such that the extended function 

/ fiix) 



\ fm{x) 



has a sparse Jacobian matrix. Our techniques are geared to the solution of large-scale 
optimization problems. For an extensive treatment of techniques for computing deriva- 
tives of general and partially separable functions with automatic differentiation tools, we 
recommend the recent book by Griewank pC|]. 

Partially separable functions were introduced by Griewank and Toint |22]. They showed, 
in particular, that / : M" i— > R is partially separable if and only if the Hessian matrix 
V^/(x) is sparse. Partially separable functions also arise in systems of nonlinear equations 
and nonlinear least squares problems. For example, if each component of the mapping 
r : R" 1-^ R™ is partially separable, then 

/(x) = i||r(x)f 

is also partially separable. As another example, consider the constrained optimization 
problem 

min{/(x) : xi < X < Xu, Q < c{x) < c^} , 

where c : R" i— > R™ specifies the constraints. For this problem, the Lagrangian function 
L{-,u) defined by (|1.1| ) is partially separable if / and all the components of the mapping 
c are partially separable. For specific examples note that the functions / and c in the 
parameter estimation and optimal control optimization problems in the COPS |T7| collection 
are partially separable. 

We are interested in computing the gradient and the Hessian of a partially separable 
function with guaranteed bounds in terms of both computing time and memory require- 
ments. We require that the computing time be bounded by a multiple of the computing 
time of the function, that is, 

r{V/(x)} < n^^a T{fix)}, T{V2/(x)} < T{fix)}, (2.2) 

for constants i^T,G and ^It,h7 where T{-} is computing time. We also require that 

M{Vf{x)} < n,,,a M{fix)}, M{VV(x)} < Qm^h M{f{x)} (2.3) 

for constants ^Im.g and J^a/,//, where M{-} is memory. 
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These are important requirements for large-scale problems. In particular, if the constants 
in these expressions are small and independent of the structure of the extended function f^, 
then the computational requirements of an iteration of Newton's method are comparable 
with those of a limited-memory Newton's method. 



The constants in (2.2) and (|2.3|) can be bounded in terms of a measure of the sparsity 



of the extended function. We use Pm, where 

Pm = max{/)j}, 

and Pi is the number of nonzeros in the ith row of fj(x). We can also view p^/ as the 
largest number of variables in any of the component functions. 

Decompositions with the number m of element functions of order n, and with Pm 

small and independent of n, are preferred. Since the number of nonzeros in the Hessian 
V'^f{x) is no more than mpM, decompositions with these properties are guaranteed to have 
sparse Hessian matrices. Discretizations of parameter estimation and optimal control prob- 
lems, for example, have these properties because in these problems each element function 
represents the contributions from an interval or an element in the discretization. 

One of the aims of this paper is to present numerical evidence that we can compute the 
gradient Vf{x) and the Hessian matrix V'^f{x) of a partially separable function with 

^T,G < kiPm, ^t,h < f^2p'ii, (2.4) 

where ki and K2 are constants of modest size and independent of /g. We normalize I^t.g by 
Pm because the techniques in Section |3| require at least p^ functions evaluations to estimate 
the gradient. Similarly, the number of gradient evaluations needed to estimate the Hessian 
matrix by the techniques in Section ^ is at least Pm- Thus, these techniques require at least 
p\j function evaluations to estimate the Hessian matrix. 

3 Computing Gradients 

We now outline the techniques that we use for computing the gradients of partially separable 
functions. For additional information on the techniques in this section, see 

Computing the gradient of a partially separable function so that the bounds (|2.2D and 



on 



(2.3) are satisfied is based on the observation, due to Andreas Griewank, that if / 
is partially separable, then 

f{x) = Uxfe, 
where e € M™' is the vector of all ones, and hence 

Vfix) = h'ixfe. (3.1) 

We can then compute the gradient by computing the Jacobian matrix fJix). 

At first sight the approach based on ( |3.1| ) does not look promising, since we need to 
compute a Jacobian matrix and then obtain the gradient from a matrix- vector product. 
However, the key observation is that the Jacobian matrix is sparse, while the gradient is 
dense. Thus, we can use sparse techniques for the computation of the extended Jacobian. 



4 



We could also use the reverse approach of automatic differentiation to compute the 
gradient of /. The reverse approach works directly on / and does not require the partial 
separability structure of /. Moreover, for the reverse approach, ( |2.2D holds with Ot,g small 
and independent of Pm- Theoretically VtT,G < 5, but practical implementations may not 
satisfy this bound. However, the memory requirements of the reverse approach depend 
on the number of floating point operations needed to compute /, and thus ( |2.3D can be 
violated. A careful comparison between the reverse approach and the techniques described 
below would be of interest. 

In this section we consider two methods for computing the gradient of a partially sep- 
arable function via (|3.1| ). In the compressed AD approach, automatic differentiation tools 
are used to compute a compressed form of the Jacobian matrix of the extended function 
/b, while in the sparse AD approach, automatic differentiation tools are used to compute 
a sparse representation of the Jacobian matrix of the extended function. 

In the compressed AD approach we assume that the sparsity pattern of the Jacobian 
matrix fsi^) is known. Given the sparsity pattern, we partition the columns of the Jacobian 
matrix into groups of structurally orthogonal columns, that is, columns that do not have 
a nonzero in the same row position. Given a partitioning of the columns into p groups 
of structurally orthogonal columns, we determine the Jacobian matrix by computing the 
compressed Jacobian matrix /^'(a;)^, where V G M'^^P. There is a column of V for each 
group, and Vi^j 7^ only if the ith column of fsix) is in the j'th group. Software for this 
partitioning problem ||ll| defines the groups with an array ngrp that sets the group for each 
column. 

The extended Jacobian can be determined from the compressed Jacobian matrix fE'{x)V 
by noting that if column j is in group fc, then 

{eijE{x)Vek) = VijdijfEix). 

Thus dijfsix) can be recovered directly from the compressed Jacobian matrix. 

We note that for many sparsity patterns, the number of groups p needed to determine 
A G jg'^x" with a partitioning of the columns is small and independent of n. In all cases 
there is a lower bound of p > Pm- We also know jl^ that if a matrix A can be permuted 
to a matrix with bandwidth band{A) , then p < band{A) . 

The sparse AD approach uses a sparse data representation, usually in conjunction with 
dynamic memory allocation, to carry out all intermediate derivative computations. At 
present, the SparsLinC library in ADIFOR is the only automatic differentiation tool 
with this capability. The main advantage of the sparse AD approach over the compressed 
AD approach is that no knowledge of the sparsity pattern is required. On the other hand, 
the sparse AD approach is almost always slower, and can be significantly slower on vector 
machines. 

In an optimization setting, a hybrid approach |^ is the best approach. With this strat- 
egy, the sparse AD approach is used to obtain the sparsity pattern of the Jacobian matrix 
of the extended function at the starting point. See Section ^ for additional information on 
techniques for computing the sparsity pattern of the extended function. Once the sparsity 
pattern is determined, the compressed AD approach is used on all other iterations. The 
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hybrid approach is currently the best approach to compute gradients of partiahy separable 
functions, and is used in all solvers installed on the NEOS Server. 

We conclude this section with some recent results on using the sparse AD approach to 
compute the gradients of partially separable functions drawn from the MINPACK-2 Q collec- 
tion of test problems. We selected ten problems; the first five problems are finite element 
formulations of variational problems, while the last five problems are systems of nonlin- 
ear equations derived from collocation or difference formulations of systems of differential 
equations. 



Table provides the value of Pm for the ten problems in our performance results. For 
each of the problems we used three values of n, usually n G {1/4,1,4} • 10^, to observe 
the trend in performance as the number of variables increases. The results were essentially 
independent of the number of variables, so our results are indicative of the performance 
that can be expected in large-scale problems. 

Table 3.1: Data for MINPACK-2 test problems 





PJB 


MSA 


ODC 


ssc 


gl2 


FIG 


SFD 


lER 


SFI 


FDC 


Pm 


5 


4 


4 


4 


5 


9 


14 


17 


5 


13 



We want to show that the bounds ( p.4| ) for VIt,g holds for these problems. For these 
results we used the sparse approach to compute the Jacobian matrix /^'(x) of the extended 



function, and then computed the gradient of / with (3.1). For each problem we computed 
the ratio ki, where 

r{V/(x)} = Ki/5,,maxr{/(x)}. 

Table |3.2| presents the quartiles for ki obtained on a Pentium 3 (500 MHz clock, 128 MB 
of memory) with the Linux operating system. 

Table 3.2: Quartiles for ki on the MINPACK-2 problems 

min qi q2 qs max 
~T3 2.8 4.5 5^3 7.8 



The results in Table |3^ show that the bound ( p^ ) for Qt,g holds for the MINPACK-2 problems, 
with Ki small. 

These results are consistent with the results in Q, where it was shown that ki E [3, 15] 
on a SPARC-10 for another set of test problems drawn from the MINPACK-2 collection. Note 
that in Q the ratio ki was computed with pm replaced by the number of columns p in the 
matrix V. Since p > Pm, the ratios in Table ^ would decrease if we replaced Pm by p. The 
advantage of using p^^i is that the ratio ki is then dependent only on the structure of the 
function. 
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4 Computing Hessian Matrices 



We have already shown how automatic differentiation tools can be used to compute the 
gradient of a partially separable function. We now discuss the tools that are needed to 



compute the Hessian of a partially separable so that the requirements (2^) on computing 
time and (|2.3| ) on memory are satisfied. 

The techniques that we propose require the sparsity pattern of the Hessian matrix 
and that the Hessian-vector products V^/(x)t> be available. In our numerical results we 
approximate the Hessian- vector product with a difference of gradient values, but in future 
work we expect to compute Hessian- vector products with ADIFOR. 

We now show how to compute the sparsity pattern of the Hessian matrix from the 
sparsity pattern of /^'(x). We define the sparsity pattern of a matrix-valued mapping 
A -.R" M"^" in a neighborhood N{xo) of a point xq by 

5{.4(xo)} = |(i,j) :a„(x)^0, x G iV(xo)|. (4.1) 

We are interested in the sparsity pattern of the extended Jacobian and the Hessian matrix 
of a partially separable function / : i-^ IR in a region V of the form 

V = {x eW"- : xi < X < Xu} . 

Given x € T>, we evaluate the sparsity pattern S {/^'(x)} by computing /b'(xo), where xq 
is a random, small perturbation of xq, for example, 



xo = (1 + e)xo + e, |e| G [10~^, 10' 



'4i 



Then we can reliably let S {/^'(x)} be the set of (i, j) such that dijfE{xo) 7^ 0. We should 
not obtain the sparsity pattern of the Jacobian matrix by evaluating Je' at the starting 
point Xq of the optimization process because this point is invariably special, and thus the 
sparsity pattern of the Jacobian matrix is unlikely to be representative. 

The technique that we have outlined for determining the sparsity pattern is used by 
the solvers in the NEOS Server and has proved to be quite reliable. The sign of £ must be 
chosen so that xq G T>, and special care must be taken to handle the case when x; and x^ 
agree in some component. 

Given the sparsity pattern of the Jacobian matrix of the extended function, we determine 
the sparsity pattern for the Hessian V^/(x) of the partially separable function / via 

5{v2/(x)}c5{/,'(x)^/,'(x)}. (4.2) 

Note that ( |4.2| ) is valid only in terms of the definition (^]^) for a sparsity pattern. For 
example, if / : M'^ 1-^ M is defined by 

fix) = m^2) 

for a function ((> such that ^'(0) ^ 0, then (9i,2/(0) ^ 0, but dif{0) = ^2/(0) = 0. However, 
(4.2) holds because d2f{x) ^ and dif{x) ^ in a neighborhood of the origin. 
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In most cases equality holds in (4^). This happens, in particular, if / does not depend 
linearly on the variables, and 



U5{vVfc(x)}c5{vV(x)}. (4.3) 



fc=i 

If / depends linearly on some variables, say. 



then equality does not hold in ( ^.2| ). Assumption ( |4.3| ) implies that there is no cancellation 
in the computation of the Hessian V^/(x). This assumption can fail in some cases, for 
example, when /i = — /2, but holds in most cases. 



Since we are able to estimate the sparsity pattern of the Hessian matrix via (4.2), we 
could use the compressed AD approach described in Section]^ to compute the Hessian matrix 
from a compressed Hessian S/'^f{x)V. However, these techniques ignore the symmetry of 
the Hessian matrix and thus may require an unnecessarily large number of columns p in the 
matrix V. For example, an arrowhead matrix requires p = n symmetry is ignored, but 
p = 2 otherwise. 

Powell and Toint [p^ j were the first to show that symmetry can be used to reduce the 
number p of columns in the matrix V. They proposed two methods for determining a 
symmetric matrix A from a compressed matrix AV. In the direct method the unknowns 
in A are determined directly from the elements in the compressed matrix AV. In this 
method unknowns are determined independently of each other. In the substitution method 
the unknowns are determined in a given order, either directly or as a linear combination of 
elements that have been previously determined. 

These definitions of direct and substitution methods are precise but do not readily yield 
algorithms for determining symmetric matrices. Coleman and More |14| and Coleman and 



Cai [^] extended by interpreting the problem of determining symmetric matrices in 
terms of special graph coloring problems. This work led to new algorithms and a deeper 
understanding of the estimation problem. 

Software for the symmetric graph coloring problem is available ||l^ for both direct and 
substitution methods. Numerical results in |14] suggest that a direct method yields a 20% 



improvement over methods that disregard symmetry, and that the substitution method 
yields about a 30% reduction over the direct method. 



We use Algorithm iA to compute the Hessian matrix from a user-supplied extended 
function /e- This algorithm uses static memory allocation so that it is first necessary to 
determine the number of nonzeros in /^'(a^o) by computing /e'(xo) by rows, but not storing 
the entries. Once this is done, we allocate space for /b'(xo) and compute /b'(xo) and 



the sparsity pattern. Another interesting aspect of Algorithm [4.1| is that we compute the 
number of nonzeros in f J {xq)^ f e {xq) directly from the sparsity pattern of /b'(xo). In 
view of (^]^) , we then have an accurate idea of the amount of memory needed to store the 
Hessian matrix. The final step is to compute the Hessian matrix from the the compressed 
Hessian matrix V^/(xo)l^ by either a direct or a substitution method. 
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o Evaluate /e(xo) and obtain m = size/£;(a;o)- 

o Compute nnz{/£;'(xo)}. 

o Allocate space for /^'(xo). 

o Compute the sparsity pattern ^{/^'(xo)}. 

o Compute nnz{/£;'(xo)^/E'(xo)}. 

o Allocate space for V^/(a;o) 

o Compute V^/(xo) from the compressed Hessian matrix V'^f{xo)V. 

Algorithm 4.1: Computing the Hessian matrix for a partially separable function. 

We consider both direct and substitution methods to determine the Hessian matrix from 
the compressed Hessian. In both cases we are interested in the ratio K2, where 

T{V'f{x)} = K2plT{f{x)}, 

since this provides a measure of the cost of evaluating the Hessian matrix relative to the 
cost of the function. The K2 quartiles for both direct and substitution methods on the 
MINPACK-2 problems used in Section |3| appear in Table 4.1. 



Table 4.1: Quartiles for K2 on MINPACK-2 problems 



Method 


min 


Qi 


92 


Q3 


max 


Direct 


1.6 


5.1 


11.2 


15.2 


46.4 


Subsitution 


1.5 


4.1 


9.0 


12.5 


30.2 



Direct and substitution methods usually require more than pjvj gradient evaluations to 
determine the Hessian matrix, and thus the increase in the value of K2 relative to ki in 



Table 3.2 was expected. Still, it is reassuring that the median value of K2 is reasonably 
small. The largest values of K2 are due to one of the problems; if this problem is eliminated, 
then the maximal value drops by at least a factor of two. In general, problems with the 
longest computing times yield the smallest values of K2 since these problems tend to mask 
the overhead in the automatic differentiation tools and in determining the Hessian matrix. 
Moreover, these results are based on using a gradient evaluation that relies on sparse au- 
tomatic differentiation tools; the use of the hybrid approach mentioned in Section ^ should 
reduce K2 substantially. 
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